data<-read.csv("data/yelp_data_reformat.csv")
Filtering the data down so we have one row per business and don’t risk double counting or skewing results while analyzing the affect of business characteristics on their average rateing
business_data <- data %>%
select(Business...Id,Business...Stars,Business...Review.Count,Business...Wi.Fi,Business...Waiter.Service,Business...Take.out,Business...Price.Range,Business...Parking,Business...Noise.Level,Business...Good.For.Kids,Business...Accepts.Credit.Cards,Business...Ages.Allowed,Business...Has.TV,Business...Categories) %>%
mutate(Price.Range = factor(Business...Price.Range),
Business...Review.Count = as.numeric(Business...Review.Count)) %>%
distinct()
package ‘bindrcpp’ was built under R version 3.4.4
head(business_data)
Perform some basic data cleaning and validations
constant_cols <- whichAreConstant(business_data)
[1] "whichAreConstant: it took me 0.04s to identify 0 constant column(s)"
double_cols <- whichAreInDouble(business_data)
[1] "whichAreInDouble: it took me 0.14s to identify 0 column(s) to drop."
bijections_cols <- whichAreBijection(business_data)
[1] "whichAreBijection: Price.Range is a bijection of Business...Price.Range. I put it in drop list."
whichAreBijection [==========>------------] 47% in 0s
whichAreBijection [===========>-----------] 53% in 0s
whichAreBijection [=============>---------] 60% in 0s
whichAreBijection [==============>--------] 67% in 0s
whichAreBijection [================>------] 73% in 0s
whichAreBijection [=================>-----] 80% in 0s
whichAreBijection [===================>---] 87% in 0s
whichAreBijection [====================>--] 93% in 0s
[1] "whichAreBijection: it took me 0.39s to identify 1 column(s) to drop."
Perhaps more interesting than reviews to both individual business and yelp as a platform is how reviews drive traffic and patronage. While the connection may seem obvious between high reviews and patronage/interaction it would be ideal to find a metric that was a closer proxy for actual interaction. With that in mind I’d like to look at the number of reviews a dependent variable as perhaps a closer proxy for how many people are actually visiting the establishment.
g<- ggplot(business_data,aes(x=Business...Review.Count)) +
geom_density(alpha=.3,fill="#D32323",color="#D32323")
ggplotly(g)
g2<-ggplot(business_data,aes(x=Business...Stars)) +
geom_histogram(alpha=.3,fill="#D32323",color="#D32323")
ggplotly(g2)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
business_reg <- lm(Business...Review.Count~Price.Range + Business...Stars + Business...Wi.Fi+Business...Noise.Level + Price.Range*Business...Wi.Fi+ Business...Good.For.Kids+Business...Has.TV,data=business_data)
summary(business_reg)
Call:
lm(formula = Business...Review.Count ~ Price.Range + Business...Stars +
Business...Wi.Fi + Business...Noise.Level + Price.Range *
Business...Wi.Fi + Business...Good.For.Kids + Business...Has.TV,
data = business_data)
Residuals:
Min 1Q Median 3Q Max
-261.36 -71.95 -18.84 38.52 1010.95
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -258.5275 51.2430 -5.045 5.99e-07 ***
Price.Range2 30.9450 37.5436 0.824 0.4101
Price.Range3 -40.5716 95.8266 -0.423 0.6722
Price.Range4 8.5114 60.3056 0.141 0.8878
Business...Stars 103.5992 10.7838 9.607 < 2e-16 ***
Business...Wi.Fifree 30.8771 33.5198 0.921 0.3573
Business...Wi.Fino 29.1342 29.2007 0.998 0.3188
Business...Wi.Fipaid 51.4933 69.7953 0.738 0.4609
Business...Noise.Levelaverage 16.8431 40.5900 0.415 0.6783
Business...Noise.Levelloud 0.9395 44.8121 0.021 0.9833
Business...Noise.Levelquiet -63.6865 42.3624 -1.503 0.1333
Business...Noise.Levelvery_loud -11.0146 50.0898 -0.220 0.8260
Business...Good.For.KidsTRUE -61.5745 13.2087 -4.662 3.86e-06 ***
Business...Has.TVTRUE -22.7518 10.9719 -2.074 0.0385 *
Price.Range2:Business...Wi.Fifree 77.5759 44.1901 1.756 0.0797 .
Price.Range3:Business...Wi.Fifree 115.5734 105.2879 1.098 0.2728
Price.Range4:Business...Wi.Fifree 76.5102 96.5360 0.793 0.4283
Price.Range2:Business...Wi.Fino 45.2604 39.6692 1.141 0.2543
Price.Range3:Business...Wi.Fino 105.7416 99.3487 1.064 0.2876
Price.Range4:Business...Wi.Fino NA NA NA NA
Price.Range2:Business...Wi.Fipaid NA NA NA NA
Price.Range3:Business...Wi.Fipaid NA NA NA NA
Price.Range4:Business...Wi.Fipaid NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 128.4 on 609 degrees of freedom
(5145 observations deleted due to missingness)
Multiple R-squared: 0.2946, Adjusted R-squared: 0.2737
F-statistic: 14.13 on 18 and 609 DF, p-value: < 2.2e-16
plot_model(business_reg)
business_reg <- lm(Business...Review.Count~Price.Range + Business...Stars + Business...Wi.Fi+Business...Noise.Level + Price.Range*Business...Wi.Fi,data=business_data)
summary(business_reg)
Call:
lm(formula = Business...Review.Count ~ Price.Range + Business...Stars +
Business...Wi.Fi + Business...Noise.Level + Price.Range *
Business...Wi.Fi, data = business_data)
Residuals:
Min 1Q Median 3Q Max
-137.83 -35.90 -10.68 16.10 1178.79
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -77.108 5.804 -13.285 < 2e-16 ***
Price.Range2 4.200 4.368 0.962 0.336320
Price.Range3 15.511 12.470 1.244 0.213580
Price.Range4 12.457 34.198 0.364 0.715675
Business...Stars 24.596 1.554 15.827 < 2e-16 ***
Business...Wi.Fifree 16.211 4.178 3.880 0.000106 ***
Business...Wi.Fino 13.991 3.491 4.008 6.21e-05 ***
Business...Wi.Fipaid 6.448 20.567 0.313 0.753918
Business...Noise.Levelaverage 26.721 3.752 7.122 1.20e-12 ***
Business...Noise.Levelloud 19.800 5.251 3.771 0.000165 ***
Business...Noise.Levelquiet -5.598 4.136 -1.353 0.175974
Business...Noise.Levelvery_loud 2.014 7.043 0.286 0.774955
Price.Range2:Business...Wi.Fifree 49.822 6.030 8.262 < 2e-16 ***
Price.Range3:Business...Wi.Fifree 74.585 18.772 3.973 7.18e-05 ***
Price.Range4:Business...Wi.Fifree 82.069 41.958 1.956 0.050517 .
Price.Range2:Business...Wi.Fino 44.020 5.334 8.252 < 2e-16 ***
Price.Range3:Business...Wi.Fino 70.727 15.284 4.627 3.79e-06 ***
Price.Range4:Business...Wi.Fino 83.851 39.520 2.122 0.033904 *
Price.Range2:Business...Wi.Fipaid 26.573 28.255 0.940 0.347014
Price.Range3:Business...Wi.Fipaid 9.150 41.650 0.220 0.826119
Price.Range4:Business...Wi.Fipaid NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 76.24 on 5441 degrees of freedom
(312 observations deleted due to missingness)
Multiple R-squared: 0.2276, Adjusted R-squared: 0.2249
F-statistic: 84.36 on 19 and 5441 DF, p-value: < 2.2e-16
plot_model(business_reg)
business_reg <- lm(Business...Stars~Price.Range + Business...Wi.Fi+Business...Noise.Level,data=business_data)
summary(business_reg)
Call:
lm(formula = Business...Stars ~ Price.Range + Business...Wi.Fi +
Business...Noise.Level, data = business_data)
Residuals:
Min 1Q Median 3Q Max
-2.58667 -0.44423 -0.07531 0.41333 1.92030
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.079701 0.027082 113.718 < 2e-16 ***
Price.Range2 0.011354 0.018836 0.603 0.5467
Price.Range3 0.123862 0.055099 2.248 0.0246 *
Price.Range4 0.158233 0.122134 1.296 0.1952
Business...Wi.Fifree 0.188803 0.027175 6.948 4.14e-12 ***
Business...Wi.Fino 0.174516 0.024168 7.221 5.87e-13 ***
Business...Wi.Fipaid 0.002281 0.114320 0.020 0.9841
Business...Noise.Levelaverage 0.321096 0.032331 9.932 < 2e-16 ***
Business...Noise.Levelloud 0.053050 0.045684 1.161 0.2456
Business...Noise.Levelquiet 0.364531 0.035525 10.261 < 2e-16 ***
Business...Noise.Levelvery_loud -0.148999 0.061313 -2.430 0.0151 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6649 on 5450 degrees of freedom
(312 observations deleted due to missingness)
Multiple R-squared: 0.07131, Adjusted R-squared: 0.0696
F-statistic: 41.85 on 10 and 5450 DF, p-value: < 2.2e-16
plot_model(business_reg)
We want to look at the affect a businesses current rating has on it’s ability to attract new customers and it’s ability to improve it’s overall rating. There are definately better ways to do this but for now at each point of time for which we have info we calculate the average rating at that point, the number of reviews the recieve after that point and their average rating after that point.